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.JiBSTRACT 


With  the  advances  in  VLSI  technology,  it  will  be  possible  to 
fabricate  chips  with  100,000  to  500,000  gates  per  chip.  Rather  the 
technology  to  pack  more  and  more  elements  on  a  chip  has  outpaced  the 
collective  knowledge  for  effective  use  of  chip  "real  estate".  For 
exairple,  it  is  virtually  impossible  to  test  high  density  microcircuits. 

This  report  reviews  the  existing  literature  on  VLSI  technology  with 
regards  to  proposed  methods  to  increase  reliability  and  testability. 

One  of  the  critical  problems  of  high  density  microcircuits  is  the 
limited  number  of  I/O  pins.  The  present  literature  points  out  the 
t^vo  types  of  circuit  additions  that  can  improve  circuit  reliability. 

The  report  also  provides  a  list  of  references  for  further  study 
of  Fault-Tolerant  Computing. 
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INTRODUCTION 


Contemporary  integrated  circuits  contain  as  many  components  as  the 
largest  computing  systems  of  15  xo  20  years  ago.  The  1960 's  were 

the  decade  of  gate  level  design,  the  1970 ’s  were  the  decade  of  register- 
transfer-  level  design  and  the  1980 *s  will  be  the  decade  of  processor- 
memory  switch  (PMS)  design.  The  age  of  \U.SI  is  here  and  its  technolog-y 
is  presenting  interesting  potentials  as  well  as  challenges.  The 
advantages  of  \T.SI  include  reduction  in  support  cost,  improved  reli¬ 
ability  and  improved  fault  detection.  Some  of  the  challenging  issues 
include  partitioning,  fault  models  and  dependencies,  efficient  use  of 
redundancy,  role  of  reliability  tools,  hierarchical  complexity,  self¬ 
test  during  operation,  redundancy  to  enhance  yield,  and  self-test  at 
fabrication  time. 

ks  is  the  case  in  most  design  efforts,  three  primary  factors 
were  considered  and  traded  off  against  each  other  in  the  design  of 
computer  systems:  cost,  performance  and  fault  tolerance.  In  the  past, 
systems  engineers  had  realized  that  any  attempt  to  improve  significantly 
any  one  of  these  factors  while  holding  another  constant  meant  significant 
degradation  in  the  third  factor.  The  advent  of  ITLSI  however,  seems 

to  have  affected  that  situation  dramatically.  With  VLSI,  the  economics 
are  different  and  the  rate  of  increase  in  cost  as  a  function  of  added 
gates  is  greatly  reduced.  Thus ,  if  a  system  designer  wants  to 
increase  fault  tolerance  by  adding  circuitry  while  holding  performance 
constant  (or  perhaps  even  increasing  performance)  the  net  increase  in 
manufacturing  cost  of  the  machine  will  be  very  small  in  comparison  to 
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costs  for  conventional  designs  of  the  past.  “  As  the  era  of  V^LSI  and 
VHSI  circuits  emerges,  one  integral  chip  will  impart  electronic  functions 
of  former  systems.  This  factor  tends  to  improve  physical  design  economics, 
iiuprov^e  perfoimance  and  on  the  surface,  improve  reliability  and  system 
availability.  But  problems  of  initial  yield  in  chip  fabrication  such 
as  increased  complexity,  cost  of  testing,  etc,  dramatically  erode  the 
economic  advantages  of  increased  circuit  densities.  In  particular,  the 
technology  to  pack  more  and  more  active  elements  on  a  chip  has  outpaced 
the  f-ollective  knowledge  for  systematic  and  effective  use  of  chip  "real 
estate".  For  exanple  it  is  virtually  impossible  to  test  high  density 
micro- circuits  in  the  laboratory  and  certainly  not  in  their  operational 
evironment. 

The  phase  I  of  this  research  is  to  review  literature  and  investigate 
the  technological  and  economic  feasibility  of  functional  sub-circuit 
partitioning  ivhich  elevates  redundancy  to  the  sub- circuit  level  (and 
beyond)  to  include  higher  levels  of  fault  detection  and  fault  isolation 
capability  on  a  single  chip.  The  purpose  of  the  literature  review  is 
to  ferret  out  key  factors  relevant  to  the  design  stage  of  electronic 
circuitry  that  dominantly  affect  end-user  utility  in  the  positive  and 
negative  sense.  Another  facet  of  the  literature  review  is  to  acquaint 
the  researchers  with  the  immense  literature  base  for  electronic 
technology  applicable  to  military  avionics. 
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LITERATURE  RE\TBV 


In  the  past,  the  design  o£  fault-tolerant  computing  systems  has 
been  done  in  an  ad  hoc  manner.  The  absence  of  a  unified  theory  of 
fault- tolerant  computing  can  be  attributed  to  at  least  tivo  factors: 

1.  The  high  cost  of  hardvvare  which  in  the  past  lias  limited  the 
use  of  redundancy  techniques. 

2.  A  lack  of  understanding  of  the  basic  definitions  and  goals  of 
fault- tolerant  computing. 

Avizienis  detined  a  fault -tolerant  computer  as  one  which  is  free 
from  hardiv’are  and  software  design  faults  and  can  execute  its  programs 
correctly,  obtaining  correct  results  ivithin  specified  time  limits, 
despite  the  presence  of  transient  or  permanent  operational  faults. 

The  first  special  issue  on  Fault  Tolerant  Computing  was  published 
in  the  IEEE  Transactions  on  Computers  nearly  a  decade  ago.  Six 
other  special  issues  devoted  to  this  same  topic  (5,6,7,8,9,10)  contain 
a  respresentative  sample  of  the  research  activities  that  have  taken 
place  in  fault- tolerant  computing  over  the  past  decade. 

The  most  obvious  trend  is  the  increasing  concern  with  the 
effects  of  large  scale  integration  on  fault- tolerance  techniques  that 
were  effective  for  computers  implemented  with  SSI  or  MSI  circuits. 
Different  failure  modes  may  be  anticipated  as  the  scale  of  integration 
increases;  testing  and  diagnostic  procedures  that  were  appropriate 
ten  years  ago  may  be  totally  inadequate  for  the  much  more  highly 
integrated  circuitry  of  today. 

Fault- tolerance  has  now  come  to  be  recognized  as  a  desirable  aiid 
in  some  cases  an  essential  feature  of  a  wide  range  of  computing  systems. 
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Interest  in  coirputers  capable  of  long  maintenance  free  operation  has 

been  matched  by  interest  in  low  maintenance  or  scheduled  maintenance 

commercial  systems;  inflight -control  avionic  systems  that  provide 

extremely  high  availability,  process  control  and  telephone  switching 

systems.  Also  the  fact  that  as  more  and  more  memon'  cells  are  packed 

into  a  single  chip,  the  nimiber  of  failure  modes  increases  and  the  need 

for  efficient  algorithms  to  detect  faults  in  them  becomes  more  critical. 

(11) 

Tlie  formulation  of  the  concepts  of  self  checking  logic  and  hybrid 

(12) 

redundanev'  appear  to  be  two  important  steps  toward  a  general  theorx" 

of  fault -tolerant  computing.  Hybrid  redundancy  bridges  the  gap  between 

(13) 

static  and  dynamic  redundancy  schemes  while  self  checking  design  enables 

us  to  distribute  the  monitoring  function  among  the  sub-systems.  Sedmak 
(14) 

and  Liebergol,  in  describing  a  design  approach  for  fault -tolerant 

general  purpose  computers  implemented  with  MSI,  noted  that  there  are 
significant  problems  in  using  some  conventional  fault-tolerant 
techniques  in  MSI  implementations.  Their  approach  is  to  implement 
all  the  logic  needed  to  detect  faults  in  a  MSI  chip  directly  on  the 
chip  itself  and  to  design  and  partition  this  logic  so  as  to  minimize 
the  possibility  that  any  failure  mode  is  capable  both  of  causing  a  chip 
to  malfunction  and  of  simultaneously  making  it  capable  of  reporting 
this  fact. 

Fault -Tolerant  .Aspects 

These  are  several  parameters  which  affect  the  design  of  the  fault- 
tolerant  features  as  follows: 

a.  Nearly  100  percent  immediate  fault  detection  is  necessar)^. 

b.  Complete  recovery  should  be  effected  from  transcient  or 
intermediate  failures. 
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Rapid  fault-isolation,  without  human  intcn'cntion  will  result 
in  low  doivTi  time. 

Tlie  fault- tolerant  aspects  of  the  VLSI  circuits  arc  fault  detection, 
isolation  and  recover)".  B>"  the  nid  1980' s  it  is  expected  that  speeds 

g 

for  militar)"  avionics  circuits  will  approach  10  operations  per  second 
and  the  densit)"  of  circuits  will  increase  and  the  testing  methods  will 
become  expensive  and  complex  if  not  impossible.  Since  circuit  densities 
in  the  range  of  10^  to  5  x  10^  gates  per  chip  are  within  the  reach 
of  ITSI  technologies  today[ ^testability  and  reliability  will  become 
dominant  issues.  Tliere  is  a  large  amount  of  literature  on  fault-tolerant 
computing,  beginning  with  Yon  Neumann and  Ibore  and  Shannon,  Of 
the  recent  literature  on  fault-tolerant  computing,  a  large  number  of 

papers  have  focused  on  fault- tolerant  memor>’.  Reviews  of  semiconductor 

ri7'i  nsi  ri‘''i 

memor)'  have  been  given  by  Limbindcr,  ■  Riley  ^  'and  Lenke  et  al. 

Several  important  past  studies  are  pertinent  to  the  background  of 
this  research.  Agarwal  addresses  the  problem  of  detecting  faults  in 
programmable  logic  arrays  (PLA's).  He  points  out  that  such  devices  arc 
vmlnerable  to  a  unique  class  of  contact  faults  and  he  develops  a  PLA 
model  where  in  these  faults  can  be  represented.  Crouzet  and  Laudrault 
contend  that  an  LSI  circuit  can  be  made  self-checking  if  it  is  designed 
from  the  out-set  with  that  goal  in  mind  and  if  its  various  possible 
failure  modes  are  well  understood.  Satish  Thatte  of  TTI  proposes 

several  methods  for  improving  ITSI  testabilit)".  He  proposes  to 
utilize  self- checking,  on  chip  testing,  partitioning  and  exhaustive 

checking  and  micro  diagnostics. 

Lee,  Gh^ni  and  Heron suggest  the  use  of  recovery  cache  to 

take  care  of  faults  due  to  software.  The  function  of  a  cache  is  to 
store  all  data  that  would  be  needed  to  restore  a  machine  to  the  state 
it  was  prior  to  the  execution  of  any  defective  software  module. 
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A  large  amoimt  of  recent  work  lias  dealt  with  reconfiguration 

(manual  and  automatic)  to  improve  system  performance.  Mathur  and  de 
("’ll 

Sousa  presented  a  teclmique  which  uses  configurable  XMR.  Papers  by 

Cox  ;ind  Carroll  and  Hartwell  et  al.  describe  approaches  which 

r'’4i 

involve  swapping  memor}'  bit  planes.  Srinivasan '  s ^approach  combines 

triple-modular  redunckincy  (TRM)  with  matching  the  address  decoder 

to  the  particular  faulty  array.  Mehta  et  al  ^“^'*also  addressed  the  issue 

ri'’’) 

of  internal  redundance  while  Carter  and  Schneider^  “discussed  the  basic 

principles  of  self-checking  circuits, 
r  ■’5') 

Arnold  ■“  showed  the  imi^ortance  of  having  fault  detection  and 
recoven'  for  all  the  elements  of  a  machine  in  order  to  achieve  high 
reliability.  Tanaka  et  al  “  described  the  use  of  duplication  in 
some  parts  of  a  general  purpose  computer. 
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SU-IMARY 


Tlie  critical  problem  with  the  VTSl  circuits  is  the  limited 
number  of  I/O  pins.  .As  the  number  of  functions  that  can  be  done  has 
increased,  the  pin  outs  have  remained  t^asically  the  same.  This  presents 
problems  in  that  the  chips  have  to  he  partitioned  to  make  use  of 
the  limited  I/O  pins.  Likewise,  the  interconnection  between  chips  has 
become  more  difficult  because  of  limited  I/O  pins. 

Since  "soft  faults"  can  no  longer  be  isolated  by  "hard  fault" 
error  detection  devices,  self-testing  methods  must  be  used  to  improve 
diagnostics  ar. i  testing  must  move  into  the  design  level.  As  systems 
are  currently  partitioned  into  sub-systems,  so  can  \1SI  chips  be 
par:itioned  into  subcircuits.  Critical  subcircuits  may  be  duplicated 
to  provide  redundant  paths  automatically  after  errors  are  detected. 

The  device  density  gain  will  be  traded  off  for  circuit  reliability 
as  we  tn’  to  pack  more  and  more  elements  on  chips.  It  is  also 
envisioned  that  some  redundant  circuits  will  be  provided  to  allow 
testability  since  high  density  circuits  make  it  imnossible  to  test 
circuits  in  the  laboratory"  and  in  the  field. 

Following  are  the  main  methods  used  to  increase  reliability  and 
testability.  Tlie  first  method  is  called  fault  masking.  This  is  a 
process  of  masking  a  fault  >'.ithout  really  knowing  the  nature  or  location 
of  the  fault.  Triple  modular  redundancy  (TMR)  and  two  rail  networks 
(TRX)  are  techniques  used  in  fault  masking.  These  designs  have  a  high 
redundant  ratio  i.e.  ratio  of  hardware  in  redundant  circuits  to  that  of 
a  nonredundant  circuit  is  3:1  to  4:1.  The  second  method  called  self¬ 
checking  is  a  process  which  detects  possible  error  conditions  in  the 
circuit  and  produces  an  error  signal  which  can  be  used  to  (1)  stop 
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comjxitations ,  (2)  signal  maniKil  repair  and  (3)  initiate  reconfiguration 
of  the  circuit  operation.  The  simplest  form  of  this  is  complete 
duplication.  Tlie  redundanev'  ratio  of  this  type  of  circuit  is  greater 
than  2:1.  .■\nother  type  of  self-checking  circuit  used  forced-parity 
which  involves  adding  additional  hardware  to  the  checking  circuit  which 
possesses  the  ability  to  generate  an  odd  parity  error. 

.•\nother  design  tecimique  that  is  being  extensively  used  by  IBM 
is  called  the  Level-Sense  Scan  Design  fLSSD).  Tlie  basic  piemise  of 
this  tecimique  is  that  no  feedback  loops  can  be  used  in  the  chip. 

Ail  loops  must  be  broken  doun  and  replaced  by  serial  shift  registers/ 
latches.  Tlie  latches  are  then  brought  out  to  pins,  there  by  allowing 
access  to  previously  unreachable  nodes  within  the  chip  thus  having 
the  shift  register  pin  will  also  allow  a  test  pattern  to  be  entered 
into  the  chip  register.  From  the  work  done  so  far,  it  is  apparent 
that  additional  searching  of  the  leterature  is  necessary.  A  bibliography 
at  the  end  of  the  report  offers  additional  helpful  material  that  has 
not  yet  been  studied  by  the  researchers..  Tins  list  will  include 
additional  literature  that  is  being  generated  on  this  topic  by 
various  research  organizations. 

Lquipment 

A  Motorola  M6809  Development  System  was  purchased  under  the  Phase 
T  of  the  project.  This  development  system  will  be  used  to  conduct 
tests  on  chip  reliability  and  testability  of  various  proposed  designs. 
Tills  equipment  will  also  be  used  to  train  graduate  engineers  who  will 
become  well  versed  in  VXSI/VIISI  principles  that  will  include 
considerations  of  economics,  scale,  speed  and  circuit  utility. 
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